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1.1 Introduction 

Markov chain Monte Carlo (MCMC) methods have been around for almost as 
long as Monte Carlo techniques, even though their impact on statistics has not 
been truly felt until the very early 1990s, except in the specialized fields of spa- 
tial statistics and image analysis, where those methods appeared earlier. The 
emergence of Markov based techniques in physics is a story that remains un- 
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Landau and Binder 



20051) . Also, we will not enter 



told within this survey (see 
into a description of MCMC techniques, unless they have some historical link 
as the remainder of this volume covers the technical asp ects. A comprehen- 
sivc treatment with further references can also be found in Robert and Casella 



( 20041) . 

We will distinguish between the introduction of Metropolis-Hastings based 
algorithms and those related to Gibbs sampling, since they each stem from rad- 
ically different origins, even though their mathematical justification via Markov 
chain theory is the same. Tracing the development of Monte Carlo methods, 
we will also briefly mention what we might call the "second-generation MCMC 
revolution" . Starting in the mid-to-late 1990s, this includes the development of 
particle filters, reversible jump and perfect sampling, and concludes with more 
current work on population or sequential Monte Carlo and regeneration and the 
computing of "honest" standard errors. 

As mentioned above, the realization that Markov chains could be used 
in a wide variety of situa tions only came (to mainstream statisticians) with 



Gelfand and Smith! (Il990l) , despite earlier publi c ation s in t he statistical liter- 



ature like iHastingd (|l970l ). iGeman and Gemanl (J1984J) and iTanner and Wong 
( 19871 ). Several reasons can be advanced: lack of computing machinery (think 
of the computers of 1970!), or background on Markov chains, or hesitation to 
trust in the practicality of the method. It thus required visionary researchers 
like Gelfand and Smith to convince the community, supported by papers that 
demonstrated, through a series of applicat ions, that the method was easy to un- 



derstand, easy to implement and practical ([Gelfand et al. Il990i 



1992, 



Smith and Gelfand 



1992, 



Wakefield et al 



1994j ). The rapid emergence of the dedicated BUGS 



(Bayesian inference Using Gibbs Sampling) software as early as 1991, when a 
paper on BUGS was presented at the Valencia meeting, was another compelling 
argument for adopting, at large, MCMC algorithms^ 



I. 2 Before the Revolution 

Monte Carlo methods were born in Los Alamos, New Mexico during World War 

II, eventually resulting in the Metropolis algorithm in the early 1950s. While 
Monte Carlo methods were in use by that time, MCMC was brought closer to 

J Historically speaking, the development of BUGS initiated from lGeman and Gema r] (I1984T ) 
and lPea rl ( 1987), in accord with the de velopments in the artificial intelligence community, and 
it pre-dates Gelfand and Smith (1990). 
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statistical practicality by the work of Hastings in the 1970s. 

What can be reasonably seen as the first M CMC algorithm i s wha t we 
now call the Metropolis algorithm, published by IMetropolis et al.l (| 1953T ) . It 
emanates from the same group of scientists who produced the Monte Carlo 
method, namely the research scientists of Los Alamos, mostly physicists work- 
ing on mathematical physics and the atomic bomb. 

MCMC algorithms therefore date back to the same time as the development 
of regular (MC only) Monte Carlo methods, which are usually traced to Ulam 
and von Neumann in the late 1940s. Stanislaw Ulam associates the original 
idea with an intractable combinatorial computation he attempted in 1946 (cal- 
culating the probability of winning at the card game "solitaire" ) . This idea was 
enthusiastically adopted by John von Neumann for implementation with direct 
applications to neutro n diffusion, the name "Monte Carlo" being suggested by 
Nicholas Met ropolis. (Eckhardtlll987l describes these early Monte Carlo devel- 
opments, and lHitchcockll2003l gives a brief history of the Metropolis algorithm.) 



These occurrences very closely coincide with the appearance of the very first 
general-purpose digital computer, the ENIAC, which came to life in February 
1946, after three years of construction. The Monte Carlo method was set up by 
von Neumann, who was using it on thermonuclear and fission problems as early 
as 1947. At the same time, that is, 1947, Ulam and von Neumann invented inver- 
sion and accept-reject techniques (also recounted in '. 



Eckhardt 



19871) to simulate 



from non-uniform distributions. Without computers, a rudi mentary version in - 
vented by Fermi in the 1930s did not get any recognition ([Metropolis! 119871 1. 
Note also that, as early as 1949, a symposium on Monte Carlo was supported 
by Ra nd, NBS and the Oak Ridge laboratory and that IMetropolis and Ulam 
( 19491 ) published the very first paper about the Monte Carlo method. 



1.2.1 The Metropolis et al. (1953) paper 

The first MCMC algorithm is associated with a second computer, called MA- 
NIAC 0, built in Los Alamos under the direction of Metropolis in early 1952. 
Both a physicist and a mathematician, Nicholas Metropolis, who died in Los 
Alamos in 1999, came to this place in April 1943 . The other members of the 
team also came to Los Alamos during those years, including the controversial 
Edward Teller. As early as 1942, this physicist became obsessed with the hy- 
drogen (H) bomb, which he eventually managed to design with Stanislaw Ulam, 

4 MANIAC stands for Mathematical Analyzer, Numerical Integrator and Computer. 
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using the better computer facilities in the early 1950s. 

Publ ished in June 19 5 3 in the Journal of Chemical Physics, the primary 
focus of 



Metropolis et al.l (|1953l ) is the computation of integrals of the form 



3 = J F{6) exp{-E(6)/kT}d6 j J cxp{~E(6)/kT}c\6 , 

on M. 2N , 8 denoting a set of TV particles on R 2 , with the energy E being defined 

as 

N N 

1=1 3 = 1 

where V a potential function and the Euclidean distance between particles 
i and j in 6. The Boltzmann distribution exp{—E(6)/kT} is parameterized by 
the temperature T, k being the Boltzmann constant, with a normalization factor 



Z(T) = J exp{-E(6)/kT}d6, 



that is not available in closed form, except in trivial cases. Since 9 is a 2N- 
dimensional vector, numerical integration is impossible. Given the large dimen- 
sion of the problem, even standard Monte Carlo techniques fail to correctly 
approximate 3, since exp{— E(9)/kT} is very small for most realizations of the 
random configurations of the particle system (uniformly in th e 2N square). In 



order to improve the efficiency of the Monte Carlo method, Metropolis et al 



( 19531 ) propose a random walk modification of the N particles. That is, for each 



particle i (1 < i < N), values 

x' i = Xi + a^u and y ■ = y. t + a&i 

are proposed, where both £u and £21 are uniform U{—1, 1). The energy difference 
AE between the new configuration and the previous one is then computed and 
the new configuration is accepted with probability 

min{l,cxp(-A£;/fcT)} , (1.1) 

and otherwise the previous configuration is replicated, in the sense that its 
counter is increased by one in the final aver age of the F(6t)'s over th e r moves 
of the random walk, 1 < t < r). Note that Metropolis et al.l (|1953l ) move one 
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particle at a time, rather than moving all of them together, which makes the 
initial algorithm appear as a primitive kind of Gibbs sampler! 

The authors of lMetropolis et aL ( 1953 ) demonstrate the validity of the algo- 
rithm by first establishing irrcducibility, which they call ergodicity, and second 
proving ergodicity, that is, convergence to the stationary distribution. The sec- 
ond part is obtained via a discretization of the space: They first note that the 
proposal move is reversible, then establish that exp{— E/kT} is invariant. The 
result is therefore proven in its full generality, minus the discretization. The 
number of iterations of the Metropolis algorithm used in the paper seems to be 
limited: 16 steps for burn-in and 48 to 64 subsequent iterations, which required 
four to five hours on the Los Alamos computer. 

An interesting variat ion is the Simulated Annealing algorithm, developed by 



Kirkpatrick et alJ (|1983h , who connected optimization with annealing, the cool- 



ing of a metal. Their variation is to allow the temperature T in (jl.l[) to decrease 
as the algorithm runs, according to a "cooling schedule". The Simulated An- 
nealing algorithm can be shown to find the global maximum with probability 
1, although the analysis is quite complex due to the fact that, with varying T, 
the algorithm is no longer a time-homogeneous Markov chain. 



1.2.2 The Hastings (1970) paper 



The Metropolis algorithm was later generalized bv iHastinga (|1970f ) and his stu- 



dent iPeakwy (|l973lll98lh as a statistical simulation tool that could overcome the 
curse of dime nsionality met by regul ar Monte Carlo methods, a point already 



emphasized inlMetropolis et al.lll95 



In his Biometrika paper H lHastingsl <| 1970! ) also defines his methodology for 
finite and reversible Markov chains, treating the continuous case by using a 
discretization analogy. The generic probability of acceptance for a move from 
state i to state j is 

an = 



1 



where s,j = Sji is a positive quantity ensuring that ay < 1, 7r< denotes the 
target and qtj the proposal. This generic form of probability encompasses the 



5 In fact, Hastings starts by mentioning a decomposition of the target distribution into a 
product of one- dimensional conditional distributions but this falls short of an early Gibbs 



sa mpler. 

^Hastings! <197Cft 
(l200lfT 



is one 



of the ten Biometrika papers reproduced in Tittcrington and Cox 
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forms of both 



Metropolis et al.l (|1953l ) and iBarkerl (| 19651) . At this stage, Hast- 



ings mentions that little is known about the relative merits of those two choices 
(even though) Metropolis's method may be preferable. He also warns against 
high rejection rates as indicative of a poor choice of transition matrix, but does 
not mention the opposite pitfall of low rejection rates, associated with a slow 
exploration of the target. 



The examples in the paper include a Poisson target with ail random walk 
proposal and a normal target with a uniform random walk proposal mixed 
with its reflection, i.e. a uniform proposal centered at —9t rather than at the 
current value 6t of the Markov chain. On a multivariate target, Hastings in- 
troduces a Gibbs sampling strategy, updating one component at a time and 
defining the composed transition as satisfying t he station a ry co ndition because 
ea ch component does le ave the target invariant. lHastingsl (|l970( ) actually refers 



to 



Erhman et al 



()1960l ) as a preliminary, if specific, instance of this sampler. 
More precisely, this is Metropolis-within-Gibbs except for the name. This first 
introduction of the Gibbs sampler has thus been completely overlooked, even 
though the pro of of convergenc e is completely general, based on a composition 
argument as in iTiernevI ( 1994 ) , discussed in Section 11.4.11 The remainder of 
the paper deals with (a) an importance sampling version of MCMC, (b) gen- 
eral remarks about assessment of the error, and (c) an application to random 
orthogonal matrices, with another example of Gibbs sampling. 



Three years later, iPeskunl (|l973l) published a comparison of Metropolis' and 
Barker's forms of acceptance probabilities and showed in a discrete setup that 
the optimal choice is that of Metropolis, where optimality is to be understood 
in terms of the asymptotic varia nce of any empirica l aver age. The proof is a 
direct consequence of a result by iKemenv and Snelll (|1960l ) on the asymptotic 
variance. Peskun also establishes that this asymptotic variance can improve 
upon the iid case if and only if the eigenvalues of P — A arc all negative, when 
A is the transition matrix corresponding to iid simulation and P the transition 
matrix corresponding to the Metropolis algorithm, but he concludes that the 
trace of P — A is always positive, therefore that the uniform improvement is 
impossible. 
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1.3 Seeds of the Revolution 



A number of early pioneers had brought forward the seeds of Gibbs sampling; in 
particular, Hammersley and Clifford had produced a constructive argument in 
1970 to recover a joint distributio n from i ts condition als, a resul t later cal l ed the 
Ha mmersley-Clifford theorem by iBesad (| 1974L 1 19861 ) . Besides iHastingsl (|1970l ) 



Geman and Gemanl (|1984f ) , already mentioned, o ther papers that contained 



and 



the se e ds of Gibbs sampling ar e 



Besag and Clifford (198 



( 1984lUQian and Titteringtonl (|1990l ). and lTanner and Wongj <jl987l) 



rp£ 

3l 



Broniatowski et al 



1.3.1 Besag and the Fundamental (Missing) Theorem 

In the early 1970's, Hammersley, Clifford, and Besag were working on the spec- 
ification of joint distributions from conditional distributions and on necessary 
and sufficient conditions for the conditional distributions to be compatible with 
a joint distribution. What is now known as the Hammersley-Clifford theorem 
states that a joint distribution for a vector associated with a dependence graph 
(edge meaning dependence and absence of edge conditional independence) must 
be represented as a product of functions over the cliques of the graphs, that 
is, of functions depending only on the components indexed by the labels in the 
cliqujll. 



From an historical point of view, iHammerslevi ()1974f ) explains why the 
Hammersley- Clifford theorem was never published as such, but only through 
Besad <| 19741 ) . The reason is that Clifford and Hammersley were dissatisfied 
with the positivity constraint: The joint density could be recovered from the 
full conditionals only when the support of the joint was made of the product 
of the supports of the full conditionals. While they strived to make the theo- 
rem independent of any positivity condition, their gradu ate student publis hed a 
counter-e xample t hat p ut a full stop to their endeavors (jMoussourial 1 9 74T ) . 



While 



Besaa (|l974j ) can cer tainly be cre dited to some extent of the (re 



)discovery of the Gibbs sampler. IBesad (119751 ) expressed doubt about the prac- 
ticality of his method, noting that "the technique is unlikely to be particularly 
helpful in many other than binary situations and the Markov chain itself has no 

practical interpretation" , clearly understating the imp ortance of his work. 

A more optimistic sentiment was expressed earlier bv lHammerslev and Handscomb 



(jl964l ). in their textbook on Monte Carlo methods. There they cover such 



7 A clique is a maximal subset of the nodes of a graphs such that every pair of nodes within 
the clique is connected by an edge {Crcssic 199j|). 
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topics as "Crude Monte Carlo"; importance sampling; control variates; and 
"Conditional Monte Carlo" , which looks surprisingly like a simulation approach 
to missing-data models (see Section 11.3. 2[) . Of course, they do not cover the 
Hammcrsley-Clifford theorem but they state in the Preface: "We are convinced 
nevertheless that Monte Carlo methods will one day reach an impressive matu- 
rity. " Well said! 



1.3.2 EM and its Simulated Versions as Precursors 



Due t o its connection with missing data problems, the EM algori thm (IDempster et al 
1977) has e arly connections with Gibb s sampling^ For instance. iBroniatowski et ; 
and lCeleux and Dieboltl (jl985l ) had tried to overcome the dependence of 



EM methods on the starting value by replacing the E step with a simulation 
step, the missing data z m being generated conditionally on the observation x 
and on the current value of the parameter 6 m . The maximization in the M step 
is then operated on the simulated complete-data likelihood, L(6\x, z rn ), produc- 
ing a new value 9 m+ i an d thi s appears as a predecessor to the Gibbs step of 



Gelman and King! (| 1990T ) and iDiebolt and Robertl (|1994j ) for mixture estima- 
tion^ Unfortunately, the th e oretic al convergence results for these methods are 



limited. ICeleux and Dieboltl (| 1990T) have, however, solved the convergence prob- 
lem of SEM by devising a hybrid version called SAEM (for Simulated Annealing 
EM), where the amount of randomness in the simulations decreases with the 
iterations, ending up with an EM algorithm^ 



1.3.3 Gibbs, and Beyond 

Although somewhat removed from statistical inference in the classical sense and 
based on earlier technique s used in statistical physics, the landmark paper by 



Geman and Gemanl (|1984l ) brought Gibbs sampling into the arena of statistical 
application. This paper is also responsible for the name Gibbs sampling, because 
it implemented this method for the Bayesian study of Gibbs random fields which, 



8 This is especially relevant when cons i dering the early introduction of a Gibbs sampler by 
data augmentation in [Tanner and Wonel l|l987l V 

9 The achievement in the former paper remained unoticed for several years due to the low- 
key and off-handed use of the Gibbs sampler at a time when it was unknown to most of the 
community. 

10 Other and m ore well-known connections between EM and MCMC algorithms can be found 
in the literature llLiu and Rubinlll994l . lMeng and Rubiniri99^ . IWei and Tanned[l990T ), but the 
connection with Gibbs sampling is more tenuous in that the simulation methods there are 
used to approximate quantities in a Monte Carlo fashion. 
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in turn, derive their name from the physicist Josiah Willard Gibbs (1839-1903). 
This original implementation of the Gibbs sampler was applied to a discrete 
image processing problem and did not involve completion as in Section 11.3.21 
But this was one more spark that led to the explosion, as it had a clear influence 
on Green, Smith, Spiegelhalter and others. 

The extent to which Gibbs sampling and Metropolis algorithms were in 
use within the image analysis and p oint process communities is actually quite 
large, as illustrated in Irliplev (1987) where Section §4.7 is entitled "Metropolis' 
method and random fields" and describes the implementation and the validation 
of the Metropolis algorithm in a finite setting with an application to Markov 
random field s and t he corresponding issue of bypassing the normalizing constant. 



Besag et al.1 (|1991l ) is another striking example of the activity in the spatial 



statistics community at the end of the 1980's. 



1.4 The Revolution 



Gelfand and Smith 



The g ap of more than 30 years between lMetropolis et al.1 (|1953l ) and[| 
([1990) can still be partially attributed to the lack of appropriate computing 
power, as most of the examples now processed by MCMC algorithms could not 
ha ye been treated previous ly, even though the hundreds of dimensions processed 



Metropolis et al.l (jl953l) were quite formidable. However, by the mid-1980s, 



the pieces were all in place. 

After Pcskun, MCMC in the statistical world was dormant for about 10 
years, and then several papers appeared that highlighted its usefulness in spe- 
cific settin gs like pattern recognition , image analysis or spatial statistics. In 
particular, iGeman and Gemanl (|1984j ) influenced Gelfand and Smith (1990) to 
write a paper that is the genuine starting point for an intensive use of MCMC 
methods by the mainstream statistical community. It sparked new interest in 
Bayesian methods, statistical computing, algorithms, and stochastic processes 
through the use of computing algorithms such as the Gi bbs sampler and the 
Metropolis-Hastings algorithm. ICasella and Georgelll992l wrote an elementary 
introduction to the Gibbs sampleJ 11 ! in American Statistician that disseminated 
the technique to a wider community while explaining in simple terms why the 



11 On a humorous note, the original Technical Report of this paper was called Gibbs for Kids, 
which was changed because a referee did not appreciate the humor. However, our colleague 
Dan Gianola, an Animal Breeder at Wisconsin, liked the title. In using Gibbs sampling in his 
work, he gave a presentation in 1993 at the 44th Annual Meeting of the European Association 
for Animal Production, Arhus, Denmark. The title: Gibbs for Pigs. 
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algorithm is valid. 



Interestingly, the ea rlier paper bv iTanner and Wond (|1987l ) had essentially 
the same ingredients as belfand and Smithl il99olT namely the fact that simu- 
lating from the conditional distributions is sufficient to asymptotically simulate 
from the joint. This paper was considered important enough to be a discussion 
paper in the Journal of the Ameri can Statistical Association , but its impact was 
somehow limited, compared with lGelfand and Smithl (|1990I ). There are several 
reasons for this; one being that the method seemed to only apply to missing data 
problems, this impression being reinforced by the name data augmentation, and 
another is that the authors were more focused on approximating the posterior 
distribution. They suggested a MCMC approximation to the target tt(0\x) at 
each iteration of the sampler, based on 



1 



k=l 



J.A- 



x t -i(z\x), k 



that is, by replicating m times the simulations from the current approxima- 
tion Tft-i(z\x) of the marginal posterior distribution of the missing data. This 
focus on estimation of the posterior distribution connected the original Data 
Augmentation algorithm to EM, as pointed out by Dempster in the discussion. 
Although the discussion by Morris gets very close to the two-stage Gibbs sam- 
pler for hierarchical models, he is still concerned about doing m iterations, and 
worries about how costly that would be. Tanner and Wong mention taking 
m = 1 at the end of the paper, referring to this as an "extreme case" . 

In a sense 



Tanner and Wong (|1987T ) was still too close to Rubin's 



1978 



multiple imputation to start a new revolution. Yet another reason for this may 
be that the theoretical background was based on functional analysis rather than 
Markov chain theory, which needed, in particular, for the Markov kernel to be 
uniformly bounded and cquicontinuous. This may have discouraged potential 
users as requiring too much mathematics. 

The authors of this review were fortunate enough to attend many focused 
conferences during this time, where we were able to witness the explosion of 
Gibbs sampling. In the summer of 1986 in Bowling Green, Ohio, Smith gave a 
series of ten lectures on hierarchical models. Although there was a lot of comput- 
ing mentioned, the Gibbs sampler was not fully developed yet. (Interestingly, 
Smith commented that the limiting factor, at that time, for the full exploitation 
of hierarchical models in statistical problem, was the inability to compute high- 
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dimensional integrals.) In another lecture in June 1989 at a Bayesian workshop 
in Sherbrookc, Quebec, he revealed for the first time the generic features of 
Gibbs sampler, and we still remember vividly the shock induced on ourselves 
and on the whole audience by the sheer breadth of the method: This develop- 
ment of Gibbs sampling, MCMC, and the resulting seminal paper of Gelfand 
and Smith (1990) was an epiphany in the world of statistics. 

Definition: epiphany n. A spiritual event in which the essence of a given object 
of manifestation appears to the subject, as in a sudden flash of recognition. 

The explosion had begun, and just two years later, an MCMC conference at 
Ohio State University organized by Gelfand, Goel, and Smith, consisted of three 
full days of talks. Many of the talks were to become influen t ial pa p ers; includin g 



Albert and Chib dl993h . 



Liu et al 



.'I'll. 



Gc 



1993 1. and 



Tiernevl (jl994l ). 



man and Rubml (jl992(UGeyerl l|l992t ). lGilksl 1 1993 1 



Approximately one year later, in May of 1992, there was a meeting of the 
Royal Statistical Society on "The Gibbs sampler and other Markov chain Monte 
Carlo methods" , where four papers were presented followed by much discussion. 
The papers appear in the first volume of JRSSB in 1993, together with 49 (!) 
pages of discussion. The excitement is clearly evident in the writings, even 
though the theory and implementation were not always perfectly understood. 

Looking at these meetings, we can see the paths that Gibbs sampling would 
lead us down. In the next two sections we will summarize some of the advances 
from the early to mid 1990s. 



1.4.1 Advances in MCMC Theory 



Perhaps the most influential MCMC theory paper of the 1990s is lTiernevi (|l994f) . 
who carefully laid out all of the assumptions needed to analyze the Markov 
chains and then developed their properties, in particular, convergence of ergodic 
averages and central li mit theorems. In one of the discussions of that paper, 
Chan and Geverl (|1994T ) were able to relax a condition on Tierney's Central Limit 
Theorem, and this new condition plays an important role in research today (see 
Se ction 11.5.41). A pair o f very infiuc itial, and innovative, papers is the work 



of 



Liu et al 



1994 



1995T ). who very carefully analyzed the covariance structure 



of Gibbs sampling, and were able to formally establish the validity of Rao- 
Blackwellization in Gibbs sampling. Gelfand and Smith (1990) had used Rao- 
Blackwellization, but it was not justified at that time, as the original theorem 
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was only applicabl e to iid sampling , which is not the case in MCMC. Another 
significant entry is lRosenthall (|19951 ). who obtained one of the earliest results on 

exact rates of convergence. 

Another paper must be singled out, namely iMengersen and Tweedid (|1996l ). 
for setting the tone for the study of the speed of convergence of MCMC algo- 
rithms to the target distribution. Subsequent works in this area by Richard 
Tweedie, Gareth Roberts, Jeff Rosenthal and c o-authors are too nu merous to 

(119971 ) must be 



Roberts et al 



be mentioned here, even though the paper by 
cited for setting explicit targets on the acceptance rate of the rando m walk 
Metropolis-Hastings algorithm, as well as lRoberts and Rosenthall (|1999l ) for get- 
ting an upper bound on the number of iterations (523) needed to approximate 
the target up to 1% by a slice sampler. The untimely death of Richard Tweedie 
in 2001 alas had a major impact on the book about MCMC convergence he was 
contemplating with Gareth Roberts. 

One pitfall arising from the widespread use of Gibbs sampling was the ten- 
dency to specify models only through their conditional distributions, almost 
always without referring to the positivity conditions in Section 11.31 Unfortu- 
nately, it is possible to specify a perfectly legitimate-looking set of conditionals 
that do not corre spond to any joint d i stribu tion, and the resulting Gibbs chain 



cannot converge. iHobert and Casellal (|l996l ) were able to document the condi- 
tions needed for a convergent Gibbs chain, and alerted the Gibbs community 
to this problem, which only arises when improper priors are used, but this is a 
frequent occurrence. 



Much other work followed, and continues to grow today. iGever and Thompson 



(|1995l) describe how to put a "ladder" of ch ains t ogether to have both "hot" 
and "cold" explorat ion, followed by Neal's Il996l introduction of tempering; 



Athreva et al. ( 1996) gav e mo re easily ver i fiable conditions for convergence; 



Meng and van Dvkl (|1999l ) and iLiu and Wul (|1999h developed the theory of pa- 



rameter expansion in the Data Augmentation algorithm, leading to construe- 



Hobert and Marchev 



tionof chains with faster convergence, and to the work of 
(J2008), who give precise constructions and theorems to show how parameter 
expansion can uniformly improve over the original chain. 



1.4.2 Advances in MCMC Applications 



The real reason for the explosion of MCMC methods was the fact that an enor- 
mous number of problems that were deemed to be computational nightmares 
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now cracked open like eggs. As an example, consider this very simple random 
effects model from Gclfand and Smith (1990). Observe 



y. 



1, 



J 



1, 



J, 



(1.2) 



where 



N(0,(Tg), independent of i 



Estimation of the variance components can be difficult for a frequentist (REML 
is typically preferred) but it indeed was a nightmare for a Bayesian, as the 
integrals were intractable. However, with the usual priors on fx, erg, and cr^, the 
full conditionals are trivial to sample from and the problem is easily solved via 
Gibbs sampling. Moreover, we can increase the number of variance components 
and the Gibbs solution remains easy to implement. 

During the early 1990s, researchers found that Gibbs, or Metropolis-Hastings, 
algorithms would crack almost any problem that they looked at, and there was 
a veritable flood of papers applying MCMC to previously intractable models, 
and getting good solutions. For example, building on (|1.2j) . it was quickly re- 
alized that Gi bbs sampling was an eas y route to getting estimates in the linear 



mix ed models (Wang et al 



1993J, [1994) , and even generalized linear mixed mod- 



els (|Zeger and Karimlll991l ). Building on the experience gained with the EM 
algorithm, similar arguments made it possible to a nalyze probit models u sing a 
latent variable approach in a linear mix ed model (Albert and Chiblll993l) . and 
in mixture models with Gibbs sampling (jDiebolt and Robertlll994h . It progres- 
sively dawned on the community that latent variables could be artificially intro- 
d uced to run the Gibbs sampler in about every situation, as eventual ly publishe d 



Damien et al 



(|1999i ). the main example being the slice sampler (|Nealll2003f) 



A very incomplete list of some oth er applicatio ns include changepoint analysis 



( Carlin et al 



1992 



Stephens 



1993 l lStephens and Smit 



ithll993h : 



19941 ): Genomics dChurchill 



19921 ) ; var iable selection in regressio n (jGeorge and McCullochl 11993 ) ; spatial 



capture -recapture iPupuisI 



1995 



Lawrence et al 



1995 



George and Robert 



statis tics (jRafterv and Banfieldl 119911 ). and longitudinal studies (jLange et al 
1992). 



Many of these applications were a dvanced though other developm ents such 



as the Adaptive Rejection Sampling of lGilka (|1992l ) 



Gilksetal 



(|1995I ). and the 
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simulated tempering approaches of lGever and Thompson! (|1995l ) or lNeall (|l996T ) 



1.5 After the Revolution 



After the revolution comes the "second" revolution, but now we have a more 
mature field. The revolution has slowed, and the problems are being solved 
in, perhaps, deeper and more sophisticated ways, even though Gibbs sampling 
also offers to the amateur the possibility to handle Bayesian analysis in complex 
models at little cost, as exhibited by the widespread use of BUGS, which mostly 
focused on this approach. But, as before, the methodology continues to expand 
the set of problems for which statisticians can provide meaningful solutions, and 
thus continues to further the impact of statistics. 



1.5.1 A Brief Glimpse at Particle Systems 

The realization of the possibilities of iterating importance sampling is not new: 
in fact, it is about as old as Monte Carlo methods th emselves. It can be found 
in the molecular simulation literatur e of the 50's, as in Hammcrs ley and Morton 
( 1954h . iRosenbluth and Rosenbluthl |l955l) and iMarshall ( 19651) . ] lammerslcy 



and colleagues proposed such a method to simulate a self-avoiding random walk 
(see Madras and Sladd Il993f ) on a grid, due to huge inefficiencies in regular 



importance sampling and rejection techniques. Although this early implemen- 
tation o ccurred in parti cle phys ics, the use of t h e term "particle" only dates 
back to Kitagawal (|1996l ). while ICarpenter et al.l (|1997l ) coined the term "par- 



ticle filter" . I n signal processing, ear l y occ urrences of a particle filter can be 
traced back to 



Handschin and Mavnel (|1969l ). 



More in connection with our theme, the landmark paper of 



Gordon et al 



( 19931 ) introduced the bootstrap filter which, while formally connected with im- 



porta nce sampling, involves past simulations and possible MCMC steps (jGilks and Berzuini 



200lh . As described in the volume edited bv lDoucet et al.l (|200ll ). particle filters 
are simulation methods adapted to sequential settings where data are collected 
progressively in time as in radar detection, telecommunication correction or fi- 
nancial volatility estimation. Taking advantage of state-space representations 
of those dynamic models, particle filter methods produce Monte Carlo approxi- 
mations to the posterior distributions by propagating simulated samples whose 
weights are actualized against the incoming observations. Since the importance 



12 BUGS now uses both Gibbs sampling and Metropolis-Hastings algorithms. 
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weights have a tendency to degenerate, that is, all weights but one are close to 
zero, additional MCMC steps can be introduced at times to recover the variety 
and representativeness of the sample. Modern connections with MCMC in the 
constr uction of the proposal kernel a re to be found, for instance, in 



Doucet et al 



( 2000 ) and in Del Moral et al.l (|2006l) . In parallel, sequent ial imputation was de- 
veloped in lKong et al. (1994), while Liu and Chen (1995) first formally pointed 
out the importance of resampling in sequential Monte Carlo, a term coined by 
them. 

The recent literature on the topic more closely bridges the gap between se- 
quential Monte Carlo an d MCMC methods b y making adaptive MCMC a po s- 
sibility (see, for example, 



Andrieu et al 



2004 or 



Roberts and Rosenthal 



2005) 



1.5.2 Perfect sampling 



Introduced in the seminal paper of iPropp and Wilson! (|1996I ) , perfect sampling, 
namely the ability to use MCMC methods to produce an exact (or perfect) 
simulation from the target, maintains a unique place in the history of MCMC 
methods. Although this exciting discovery led to an outburst of papers, in 
pa rticular in the large body of work o f M0ller and coauthors, including the book 
bv lMoller and Waagepetersen (120031) , as well as man y revi ews and i n trodu ctory 



materials, like 



Casella et al 



(|200ll ). iFismenl (jl998T ). and iDimakosI (|200ll ). the 
excitement quickly dried out. The major reason for this ephemeral lifespan is 
that the construction of perfect samplers is most often cl ose to impo ssible or 
impractical, despite some advances in the implementation ( Fill 1998al fb). 



There is, however, ongoing activity in the area of point processes an d stochas 



tic geo metry, much from the work of M0ller and Kendall. In particular. iKendall and Moller 
(|2000j ) dev eloped an alternative to t he Coupling From The Past (CFTP) al- 
gorithm of IPropp and Wilson! (|1996l) , called horizontal CFTP, which mainly 
applies to point processes and is based on continuous time birth-and-dcath pro- 
cesses. See also 



Fernandez et al 



( 1999 ) for anoth er horizontal CFTP algorithm 



for point processes. iBerthelsen and Mollerl (|2003l ) exhibited a use of these algo- 
rithms for nonparametric Bayesian inference on point processes. 



1.5.3 Reversible jump and variable dimensions 



Green 



From many viewpoints, the invention of the reversible jump algorithm in 
(|1995h can be seen as the start of the second MCMC revolution: the formal- 
ization of a Markov chain that moves across models and parameter spaces al- 
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lowed for the Bayesian processing of a wide variety of new models and con- 
tributed to the success of Bayesian model choice and subsequently to its adop- 
tion in other fields. There exist earlier alternative Monte Carlo solutions like 



Gelfand and Devi (119941 ) and lCarlin and Chibl (|1995l ). the later being very close 



in spirit to revers ible jump MCMC (as shown by the completion scheme of 



Brooks et al 



20031 ). but the definition of a proper balance condition on cross- 



model Markov kernels in iGreenl (|1995| ) gives a generic setup for exploring vari- 
able dimension spaces, even when the number of models under comparison is 
infinite. The impact of this new idea was clearly perceived when looking at the 
First European Conference on Highly Structured Stochastic Systems that took 
place in Rcbild, Denmark, the next year, organized by Stephen Lauritzen and 
Jcspcr M0ller: a large majority of the talks were aimed at direct implementa- 
tions of RJMCMC to various inference problems. The a pplication of RJMCMC 
to mix ture order estimation in the discussion paper of 



Richardson and Green 



1997 ) ensured further dissemination of the technique. More recently, IStephens 



200C ) proposed a contin uous time version of RJMCM C, based on earlier i deas of 



Gever and Moller | (|1994f) . but with similar properties (jCappe et al.ll20031 ). while 



Brooks et al.1 ( 2003 ) made proposals for increasing the efficiency of the moves. 



In retrospect, while reversible jump is somehow unavoidable in the processing 
of very la rge numbers of models u nder comparison, as for instance in variable 



selection (Marin and Robert 



20071 ) , the implementation of a complex algorithm 



like RJMCMC for the comparison of a few models is somewhat of an overkill 
since there e xist alternative s olutions based on model specific MCMC chains, 
for example (jChen et alJl2000| ). 



1.5.4 Regeneration and the CLT 

While the Central Limit Theorem (CLT) is a central tool in Monte Carlo con- 
vergence assess ment, it s use i n MCMC setups took longer to emerge, despite 
early signals by iGeverl (119921 ). and it is only recently that sufficiently clear 
conditions emerged. We recall that the Ergodic Theorem (see, for example, 



Robert and Casella 



2004L Theorem 6.63) states that, if (9t)t is a Markov chain 



with stationary distribution ir, and h(-) is a function with finite variance, then 
under fairly mild conditions, 



lim h Tl 

n~ y oo 



h(6)w(0) d0 = E„[h(9)], 



(1.3) 
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almost everywhere, where h n = (l/n-)X)"=i F° r the CLT to be used to 

monitor this convergence, 



^j(h n -E v [h{6)]) 
y/Vax[h(0)] 



->N(0,1), 



(1.4) 



there are two roadblocks. First, convergence to normality is strongly affected by 
the lack of independence. To get CLTs for Markov chains, we can use a result of 
Kipnis and Varadhanl (|1986l ). which requires the chain to be reversible, as is the 
case for Metropo lis-Hastings chains, or we must delve into mixing conditions 
( BillingsleJl994 

Secti on 27) which are typically not easy to verify. However, 
Chan and Geverl (|1994h showed how the condition of geometric ergodicity could 
be used to establish CLTs for Markov chains. But getting the convergence is 
only half of the problem. In order to use (|1 -4[) , we must be able to consistently 
estimate the variance, which turns out to be another difficult endeavor. The 
"naive" estimate of the usual standard error is not consistent in the dependent 
case and the most promising paths for consistent variance estimates seems to 
be through regeneration and batch means. 



The theory of regeneration uses the concept of a split chain (jAthreva and Nev 



19781) . and allows us to independently restart the chain while preserving the 



stationary distribution. These independent "tours" then allow the calcula- 
tion of consistent variance estimates and honest monitoring of convergence 
through (11.41). Early work on app lying re g enera tion to MCMC chains was 



done by iMvkland et al.l (| 1995T ) and iRobertl (| 1995T ) , who showed how to con- 
struct the chains and use them for varia nce calculations and diagnostics (see also 



Guihenneuc-Jouvaux and Robert 
gorithms ( Gil ks et al J ll998h . 



1998), as we ll as deriving adaptive MCMC al- 



Rosenthall ( 1995 ) also showed how to construct and 



Jones and Hobert 



use re generative chains, and much of this work is reviewed in 
( 20011). The most inter esting and practic al developments, however, are in 



Hobert et al 



(|2002l ) and 



Jones et al 



(|2006l ). where consistent estimators are 



constructed for Vax[h(6)], allowing valid mo nitoring of convergen ce in chains 



that sat isfy the CLT . Inter estingly, although 
eration 



Hobert et al 



Jones et al 



(|2002l ) uses regen- 



(|2006f ) get their consistent estimators thorough another 



technique, that of cumulative batch means. 
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1.6 Conclusion 

The impact of Gibbs sampling and MCMC on Baycsian statistics was to change 
our e ntire metho d of thinking and attacking problems, representing a paradigm 



shift (jKuhnll 19961 ) . Now, the collection of real problems that we could solve grew 
almost without bound. Markov chain Monte Carlo changed our emphasis from 
"closed form" solutions to algorithms, expanded our impact to solving "real" 
applied problems and to improving numerical algorithms using statistical ideas, 
and led us into a world where "exact" now means "simulated" . 

This has truly been a quantum leap in the evolution of the field of statistics, 
and the evidence is that there are no signs of slowing down. Although the "ex- 
plosion" is over, the current work is going deeper into theory and applications, 
and continues to expand our horizons and influence by increasing our ability to 
solve even bigger and more important problems. The size of the data sets, and 
of the models, for example in genomics or climatology, is something that could 
not have been conceived 60 years ago, when Ulam and von Neumann invented 
the Monte Carlo method. Now we continue to plod on, and hope that the ad- 
vances that we make here will, in some way, help our colleagues 60 years in the 
future solve problems that we cannot yet conceive. 
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